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Abstract 

Aiming at solving large-scale optimization problems, this paper studies distributed optimization 
methods based on the alternating direction method of multipliers (ADMM). By formulating the opti¬ 
mization problem as a consensus problem, the ADMM can be used to solve the consensus problem in a 
fully parallel fashion over a computer network with a star topology. However, traditional synchronized 
computation does not scale well with the problem size, as the speed of the algorithm is limited by 
the slowest workers. This is particularly true in a heterogeneous network where the computing nodes 
experience different computation and communication delays. In this paper, we propose an asynchronous 
distributed ADMM (AD-ADMM) which can effectively improve the time efficiency of distributed op¬ 
timization. Our main interest lies in analyzing the convergence conditions of the AD-ADMM, under 
the popular partially asynchronous model, which is defined based on a maximum tolerable delay of the 
network. Specifically, by considering general and possibly non-convex cost functions, we show that the 
AD-ADMM is guaranteed to converge to the set of Karush-Kuhn-Tucker (KKT) points as long as the 
algorithm parameters are chosen appropriately according to the network delay. We further illustrate that 
the asynchrony of the ADMM has to be handled with care, as slightly modifying the implementation of 
the AD-ADMM can jeopardize the algorithm convergence, even under the standard convex setting. 
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I. Introduction 

A. Background 

Scaling up optimization algorithms for future data-intensive applications calls for efficient distributed 
and parallel implementations, so that modern multi-core high performance computing technologies can 
be fully utilized [2]-[4]. In this work, we are interested in developing distributed optimization methods 
for solving the following optimization problem 

N 

min ^fi{x)+h{x), (1) 

i=l 

where each fi : R” —)• R is a (smooth) cost function; h : R” —)• R U {00} is a convex ( proper and lower 
semi-continuous) but possibly non-smooth regularization function. The latter is used to impose desired 
structures on the solution (e.g., sparsity) and/or used to enforce certain constraints. Problem (1) includes 
as special cases many important statistical learning problems such as the LASSO problem [5], logistic 
regression (LR) problem [6], support vector machine (SVM) [7] and the sparse principal component 
analysis (PCA) problem [8]. In this paper, we focus on solving large-scale instances of these learning 
problems with either a large number of training samples or a large number of features (n is large) [3]. 
These are typical data-intensive machine learning scenarios in which the data sets are often distributedly 
located in a few computing nodes. Traditional centralized optimization methods, therefore, fails to scale 
well due to their inability to handle distributed data sets and computing resources. 

Our goal is to develop efficient distributed optimization algorithms over a computer network with a 
star topology, in which a master node coordinates the computation of a set of distributed workers (see 
Figure 1 for illustration). Such star topology represents a common architecture for distributed computing, 
therefore it has been used widely in distributed optimization [4], [9]-[16]. For example, under the star 
topology, references [10], [11] presented distributed stochastic gradient descent (SGD) methods, references 
[12], [13] parallelized the proximal gradient (PG) methods, while references [14]-[17] parallelized the 
block coordinate descent (BCD) method. In these works, the distributed workers iteratively calculate 
the gradients related to their local data, while the master collects such information from the workers to 
perform SGD, PG or BCD updates. 

However, when scaling up these distributed algorithms, node synchronization becomes an important 
issue. Specifically, under the synchronous protocol, the master is triggered at each iteration only if it 
receives the required information from all the distributed workers. On the one hand, such synchronization 
is beneficial to make the algorithms well behaved; on the other hand, however, the speed of the algorithms 
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Fig. 1: A star computer cluster with one master and N workers. 


would be limited by the “slowest” worker especially when the workers have different computation 
and communication delays. To address such dilemma, a few recent works [10]-[14] have introduced 
“asynchrony” into the distributed algorithms, which allows the master to perform updates when not all, 
but a small subset of workers have returned their gradient information. The asynchronous updates would 
cause “delayed” gradient information. A few algorithmic tricks such as delay-dependent step-size selection 
have been introduced to ensure that the staled gradient information does not destroy the stability of the 
algorithm. In practice, such asynchrony does make a big difference. As has been consistently reported in 
[10]-[14], under such an asynchronous protocol, the computation time can decrease almost linearly with 
the number of workers. 

B. Related Works 

A different approach for distributed and parallel optimization is based on the alternating direction 
method of multipliers (ADMM) [9, Section 7.1.1]. In the distributed ADMM, the original learning problem 
is partitioned into N subproblems, each containing a subset of training samples or the learning parameters. 
At each iteration, the workers solve the subproblems and send the up-to-date variable information to the 
master, who summarizes this information and broadcasts the result to the workers. In this way, a given 
large-scale learning problem can be solved in a parallel and distributed fashion. Notably, other than the 
standard convex setting [9], the recent analysis in [18] has shown that such distributed ADMM is provably 
convergent to a Karush-Kuhn-Tucker (KKT) point even for non-convex problems. 

Recently, the synchronous distributed ADMM [9], [18] has been extended to the asynchronous setting, 
similar to [10]-[14]. Specifically, reference [19] has considered a version of AD-ADMM with bounded 
delay assumption and studied its theoretical and numerical performances. However, only convex cases are 
considered in [19]. Reference [20] has studied another version of AD-ADMM for non-convex problems, 
which considers inexact subproblem updates and, similar to [10]-[14], the workers compute gradient 
information only. This type of distributed optimization schemes, however, may not fully utilize the 
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computation powers of distributed nodes. Besides, due to inexact update, such schemes usually require 
more iterations to converge and thus may have higher communication overhead. References [21]-[23] 
have respectively considered asynchronous ADMM methods for decentralized optimization over networks. 
These works consider network topologies beyond the star network, but their definition of asynchrony is 
different from what we propose here. Specifically, the asynchrony in [21] lies in that, at each iteration, 
the nodes are randomly activated to perform variable update. The method presented in [22] further 
allows that the communications between nodes can succeed or fail randomly. It is shown in [22] 
that such asynchronous ADMM can converge in a probability-one sense, provided that the nodes and 
communication links satisfy certain statistical assumption. Reference [23] has considered an asynchronous 
dual ADMM method. The asynchrony is in the sense that the nodes are partitioned into groups based on 
certain coloring scheme and only one group of nodes update variable in each iteration. 

C. Contributions 

In this paper\ we generalize the state-of-the-art synchronous distributed ADMM [9], [18] to the 
asynchronous setting. Like [10]-[14], [19], [20], the asynchronous distributed ADMM (AD-ADMM) 
algorithm developed in this paper gives the master the freedom of making updates only based on variable 
information from a partial set of workers, which further improves the computation efficiency of the 
distributed ADMM. 

Theoretically, we show that, for general and possibly non-convex problems in the form of (1), the 
AD-ADMM converges to the set of KKT points if the algorithm parameters are chosen appropriately 
according to the maximum network delay. Our results differ significantly from the existing works [19], 
[21], [22] which are all developed for convex problems. Therefore, the analysis and algorithm proposed 
here are applicable not only to standard convex learning problems but also to important non-convex 
problems such as the sparse PC A problem [8] and matrix factorization problems [24]. To the best of 
our knowledge, except the inexact version in [20], this is the first time that the distributed ADMM is 
rigorously shown to be convergent for non-convex problems under the asynchronous protocol. Moreover, 
unlike [19], [21], [22] where the convergence analyses all rely on certain statistical assumption on the 
nodes/workers, our convergence analysis is deterministic and characterizes the worst-case convergence 
conditions of the AD-ADMM under a bounded delay assumption only. Furthermore, we demonstrate that 
the asynchrony of ADMM has to be handled with care - as a slight modification of the algorithm may 

'in contrast to the conference paper [1], the current paper presents detailed proofs of theorems and more simulation results. 
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lead to completely different convergence conditions and even destroy the convergence of ADMM for 
convex problems. Some numerical results are presented to support our theoretical claims. 

In the companion paper [25], the linear convergence conditions of the AD-ADMM is further analyzed. 
In addition, the numerical performance of the AD-ADMM is examined by solving a large-scale LR 
problem on a high-performance computer cluster. 

Synopsis: Section II presents the applications of problem (1) and reviews the distributed ADMM in 
[9]. The proposed AD-ADMM and its convergence conditions are presented in Section III. Comparison 
of the proposed AD-ADMM with an alternative scheme is presented in Section IV. Some simulation 
results are presented in Section V. Finally, concluding remarks are given in Section VI. 

II. Applications and Distributed ADMM 

A. Applications 

We target at solving problem (1) over a star computer network (cluster) with one master node and N 
workers/slaves, as illustrated in Figure 1. Such distributed optimization approach is extremely useful in 
modern big data applications [3]. For example, let us consider the following regularized empirical risk 
minimization problem [7] 

m 

min i{ajw,yj) + Q{w), (2) 

f = l 

where m is the number of training samples and i{ajw,yj) is a loss function (e.g., regression or 
classification error) that depends on the training sample aj G M”, label yj and the parameter vector 
w G M”. Here, n denotes the dimension of the parameters (features); Q{w) is an appropriate convex 
regularizer. Problem (2) is one of the most important problems in signal processing and statistical learning, 
which includes the LASSO problem [26], LR [6], SVM [7] and the sparse PCA problem [8], to name 
a few. Obviously, solving (2) can be challenging when the number of training samples is very large. In 
that case, it is natural to split the training samples across the computer cluster and resort to a distributed 
optimization approach. Suppose that the m training samples are uniformly distributed and stored by the N 
workers, with each node i getting qi = i_m/Nj samples. By defining fi{w) = £(aju;, y^), 

i = 1,..., N, and h{w) = Q{w), if is clear that (2) is an instance of (1). 

When the number of training samples is moderate but the dimension of the parameters is very large 
(n » m), problem (2) is also challenging to solve. By [9, Section 7.3], one can instead consider the 
Lagrangian dual problem of (2) provided that (2) has zero duality gap. Specifically, let the training 
matrix A = [ai,..., G j^mxn partitioned as A = [Ai,... , Aat], and let the parameter vector 
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w be partitioned conformally as w = ..., moreover, assume that is separable as Q{w) = 

Then, following [9, Section 7.3], one can obtain the dual problem of (2) as 

N 

min yn*{A[u) + ^*{iy), (3) 

i=l 

where v = [z/i,..., 1/^]^ is a dual variable, respectively 

the conjugate functions of i and flj. Note that (3) is equivalent to splitting the n parameters across the 
N workers. Clearly, problem (3) is an instance of (1). 

It is interesting to mention that many emerging problems in smart power grid can also be formulated 
as problem (1); see, for example, the power state estimation problem considered in [27] is solved by 
employing the distributed ADMM. The energy management problems (i.e., demand response) in [28]-[30] 
can potentially be handled by the distributed ADMM as well. 

B. Distributed ADMM 

In this section, we present the distributed ADMM [4], [9] for solving problem (1). Let us consider the 
following consensus formulation of problem (1) 



N 


min 

+ h{xo) 

(4a) 

2 = 1, 

i=l 


S.t. 

T—1 

II 

> 

o 

II 

(4b) 


In (4), the + 1 variables Xi, i = 0,1,... ,A^, are subject to the consensus constraint in (4b), i.e., 
xq = xi = ■ ■ ■ = xn. Thus, problem (4) is equivalent to (1). 

It has been shown that such a consensus problem can be efficiently solved by the ADMM [9]. To 
describe this method, let A £ R” denote the Lagrange dual variable associated with constraint (4b) and 

define fhe following augmented Lagrangian function 

N 

Cp{x, Xo, A) = ^ fi{xi) + h{xo) 

i=l 

N N 

+ XI Af(a;i - ®o) + ||®i - ®o|P, (5) 

i=l i=\ 

where x = [x\, ..., x'^Y' , A = [Af,..., A^]"^ and p > 0 is a penalty parameter. According to [4], the 
standard synchronous ADMM iteratively updates the primal variables i = 0,1,... , A^, by minimizing 
(5) in a (one-round) Gauss-Seidel fashion, followed by updating the dual variable A using an approximate 
gradient ascent method. The ADMM algorithm for solving (4) is presented in Algorithm 1, 
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Algorithm 1 (Synchronous) Distributed ADMM for (4) [9] 

1: 

Given initial variables x^ and A°; set Xq = x^ and /c = 0. 



2: 

repeat 



3: 

update 




4+^=arg^mm^|/i(a;o) - x^ Y.i=i 





' 1 - 

(6) 


a;^+^ = arg min fi{xi)+x'f\^ + ^\\xi 

■^0 II ’ 



tCiGR" ^ 



Vf = 1,. 


(V) 


Afc+1 ^ ^ p^^k+i _ V f = 1,. 

..,N. 

(8) 

4: 

sti k ^ k + 1. 



5: 

until a predefined stopping criterion is satisfied. 




As seen, Algorithm 1 is naturally implementable over the star computer network illustrated in Figure 
1. Specifically, the master node takes charge of optimizing xq hy (6), and each worker i is responsible 
for optimizing {xi,Xi) by (7)-(8). Through exchanging the up-to-date xq and {xi,\i) between the 
master and the workers. Algorithm 1 solves problem (1) in a fully distributed and parallel manner. 
Convergence properties of the distributed ADMM have been extensively studied; see, e.g., [9], [18], 
[31]-[33]. Specifically, [31] shows thaf fhe ADMM, under general convex assumptions, has a worst-case 
0{l/k) convergence rate; while [32] shows that the ADMM can have a linear convergence rate given 
strong convexity and smoothness conditions on /j’s. For non-convex and smooth /j’s, the work [18] 
shows that Algorithm 1 can converge to the set of KKT points with a 0{l/y/k) rate as long as p is large 
enough. 

However, Algorithm 1 is a synchronous algorithm, where the operations of the master and the workers 
are “locked” with each other. Specifically, to optimize xq at each iteration, the master has to wait until 
receiving all the up-to-date variables {xi, A*), i = 1,... ,N, from the workers. Since the workers may 
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have different computation and communication delays^, the pace of the optimization would he determined 
hy the “slowest” worker. As an example illustrated in Figure 2(a), the master updates xq only when it 
has received the variable information for the four workers at every iteration. As a result, under such 
synchronous protocol, the master and speedy workers (e.g., workers 1 and 3 in Figure 2) would spend 
most of the time idling, and thus the parallel computational resources cannot he fully utilized. 


III. Asynchronous Distributed ADMM 

A. Algorithm Description 

In this section, we present an AD-ADMM. The asynchronism we consider is in the same spirit of 
[10]-[14], [19], [20], where the master does not wait for all the workers. Instead, the master updates xq 
whenever it receives (a^j, Aj) from a partial set of the workers. For example, in Figure 2(h), the master 
updates xq whenever it receives the variable information from at least two workers. This implies that 
none of the workers have to be synchronized with each other and the master does not need to wait for the 
slowest worker either. As illustrated in Figure 2(b), with the lock removed, both the master and speedy 
workers can update their variables more frequently. 



I Computation delay 
of the workers 


Communication Computation delay 

delay of the master 


I Computation delay 
of the workers 


Communication H Computation delay 
delay ® of the master 


(a) Synchronous distributed ADMM 


(b) Asynchronous distributed ADMM 


Fig. 2: Illustration of synchronous and asynchronous distributed ADMM. 

Let us denote A: > 0 as the iteration number of the master (i.e., the number of times for which the 
master updates xq), and denote Ak C V = {!,..., A^} as the index subset of workers from which the 
master receives variable information during iteration k (for example, in Figure 2(b), Aq = {1,3} and 

^In a heterogeneous network, the workers can have different computational powers, or the data sets can be non-uniformly 
distributed across the network. Thus, the workers can require different computational times in solving the local subproblems. 
Besides, the communication delays can also be different, e.g., due to probabilistic communication failures and message 
retransmission. 
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= {1,2})^. We say that worker i is “arrived” at iteration /c if i G Ak and “unarrived” otherwise. 
Clearly, unbounded delay will jeopardize the algorithm convergence. Therefore throughout this paper, we 
will assume that the asynchronous delay in the network is hounded. In particular, we follow the popular 
partially asynchronous model [4] and assume: 

Assumption 1 (Bounded delay) Let t > 1 be a maximum tolerable delay. For all i ^ V and iteration 
k > 0, it must be that i G Ak U Ak-i • • • U ,4.njax{fc-T+i,-i}- 

Assumption 1 implies that every worker i is arrived at least once within the period [A: — r + 1, A:]. 
In another word, the variable information (sj, Aj) used by the master must be at most r iterations old. 
To guarantee the bounded delay, at every iteration the master should wait for the workers who have 
been inactive for r — 1 iterations, if such workers exist. Note that, when r = 1, one has i G Ak for all 
z G V (i.e., Ak = V), which corresponds to the synchronous case and the master always waits for all the 
workers at every iteration. 

In Algorithm 2, we present the proposed AD-ADMM, which specifies respectively the steps for the 
master and the distributed workers. Here, A^. denotes the complementary set of Ak, i-e., Ak H A^. = 0 
and Ak U A% = V. Algorithm 2 has five notable differences compared with Algorithm 1. First, the 
master is required to update {(aij, Ai)}igv> and such update is only performed for those variables with 
i G Ak- Second, xq is updated by solving a problem with an additional proximal term ^\\xq — Xq\\'^, 
where 7 > 0 is a penalty parameter (cf. (12)). Adding such proximal term is crucial in making the 
algorithm well-behaved in the asynchronous setting. As will be seen in the next section, a proper choice 
of 7 guarantees the convergence of Algorithm 2. Third, the variables dj’s are introduced to count the 
delays of the workers. If worker i is arrived at the current iteration, then di is set to zero; otherwise, di 
is increased by one. So, to ensure Assumption 1 hold all the time, in Step 4 of Algorithm of the Master, 
the master waits if there exists at least one worker whose dj > r — 1. Fourth, in addition to the bounded 
delay, we assume that the master proceeds to update the variables only if there are at least A > 1 arrived 
workers, i.e., |A.fc| > A for all k [19]. Note that when A = N, the algorithm reduces to the synchronous 
distributed ADMM. Fifth, in Step 6 of Algorithm of the Master, the master sends the up-to-date xq only 
to the arrived workers. 

We emphasize again that both the master and fast workers in the AD-ADMM can have less idle 
time and update more frequently than its synchronous counterpart. As illustrated in Figure 2, during the 

^Without loss of generality, we let A-i = V, as seen from Figure 2. 
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same period of time, the synchronous algorithm only completes two updates whereas the asynchronous 
algorithm updates six times already. On the flip side, the asynchronous algorithm introduces delayed 
variable information and thereby requires a larger number of iterations to reach the same solution accuracy 
than its synchronous counterpart. In practice we observe that the benefit of improved update frequency 
can outweigh the cost of increased number of iterations, and as a result the asynchronous algorithm can 
still converge faster in time. This is particularly true when the workers have different computation and 
communication delays and when the computation and communication delays of the master for solving 
(12) is much shorter than the computation and communication delays of the workers for updating (13) 
and (14)'^; e.g., see Figure 2. Detailed numerical results will be reported in Section V of the companion 
paper [25]. 

B. Convergence Analysis 

In this subsection, we analyze the convergence conditions of Algorithm 2. We first make the following 
standard assumption on problem (1) (or equivalently problem (4)): 

Assumption 2 Each function fi is twice differentiable and its gradient fi is Lipschitz continuous with 
a Lipschitz constant L > 0; the function h is proper convex (lower semi-continuous, but not necessarily 
smooth) and dom(h) (the domain of h) is compact. Moreover, problem (1) is bounded below, i.e., F* > 
—oo where F* denotes the optimal objective value of problem (1). 

Notably, we do not assume any convexity on ffs. Indeed, we will show that the AD-ADMM can converge 
to the set of KKT points even for non-convex /j’s. Our main result is formally stated below. 

Theorem 1 Suppose that Assumption 1 and Assumption 2 hold true. Moreover, assume that there exists 
a constant S G [1, A^] such that \Ak\ < S for all k and that 


OO > £p{x^,XQ, A'^) — F* >0, 

(15) 

(1 + L + L2) + ^(1 + L + L2)2 + 8 L2 

(16) 

2 

S(l + p2)(r-l)2-Np 
-,> 2 

(17) 


^Note that, for many practical cases (such as h{xo) = ||a:o||i) for which (12) has a closed-form solution, the computation delay 
of the master is negligible. For high-performance computer clusters connected hy large-handwidth fiber links, the communication 
delays between the master and the workers can also be short. However, for cases in which the computation and communication 
delays of the master is significant, the AD-ADMM could be less time efficient than the synchronous ADMM due to the increased 
number of iterations. 
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Then, }^i, generated by (9), (10) and (12) are bounded and have limit points which 

satisfy KKT conditions of problem (4). 

Theorem 1 implies that the AD-ADMM is guaranteed to converge to the set of KKT points as long as 
the penalty parameters p and 7 are sufficiently large. Since I /7 can he viewed as the step size of xq, 
(17) indicates that the master should he more cautious in moving Xq if the network allows a longer delay 
r. In particular, the value 7 in the worst case should increase with the order of r^. When r = 1 (the 
synchronous case), 7 = —{Np)/2 < 0 and thus the proximal term ^\\xo — Xq]]"^ can he removed from 
(12). On the other hand, we also see from (17) that 7 should increase with if r > 1 is fixed^. This is 
because in the worst case the more workers, the more outdated information introduced in the network. 
Finally, we should mention that a large p may he essential for the AD-ADMM to converge properly, 
especially for non-convex problems, as we demonstrate via simulations in Section V. 

Let us compare Theorem 1 with the results in [19], [22]. First, the convergence conditions in [19], 
[ 22 ] are only applicable for convex problems, whereas our results hold for both convex and non-convex 
problems. Second, [19], [22] have made specific statistical assumptions on the behavior of the workers, 
and the convergence results presented therein are in an expectation sense. Therefore it is possible, at least 
theoretically, that a realization of the algorithm fails to converge despite satisfying the conditions given 
in [19]. On the contrary, our convergence results hold deterministically. 

Note that for non-convex /j’s, subproblem (13) is not necessarily convex. However, given p > L 
in (16) and twice differentiability of fi (Assumption 2), subproblem (13) becomes a (strongly) convex 
problem^ and hence is globally solvable. When /j’s are all convex functions. Theorem 1 reduces to the 
following corollary. 

Corollary 1 Assume that ffs are ail convex functions. Under the same premises of Theorem 1, and for 
7 satisfying (17) and 

(l+i2) + ^(l + Z,2)2 + 8i2 
- 2 -> 

ajQ, {A^}^^) generated by (9), (10) and (12) are bounded and have limit points which satisfy 
KKT conditions of problem (4). 

^Note that, for a fixed r, S should increase with N. 

®By [34, Lemma 1.2.2], the minimum eigenvalue of the Hessian matrix of fi{xi) is no smaller than —L. Thus, for p > L, 
subproblem (13) is a strongly convex problem. 
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C. Proof of Theorem 1 and Corollary 1 

Let us write Algorithm 2 from the master’s point of view. Define ki as the last iteration number before 
iteration k for which worker i S Ak is arrived^, i.e., i E Aj... Then Algorithm 2 from the master’s point 
of view is as follows: for master iteration A: = 0,1,, 

argmin<^ fi{xi) PxJXl 


= 


^T\ki+l 

'2 


_|_ P.\\rp. _ ^ki-\-l\\2 

\ 2 11 *^2 11 


x’^ 


Vi G Aj^ ’ 
Vi G At 




Ai+l 


+ ViGAk 




Vz G At 


X 


k+l . 


tCoSlE 


T x^+l 


= arg min h{xo) - x^ J2i=i ^ 


+ f - xof + l\\XQ - 4||2 

Now it is relatively easy to see that the master updates xq using the delayed (a^j, and the old 

(aij, Aj)jg_ 4 =. Under Assumption 1, it must hold 


(19) 


( 20 ) 


max{A: — r, —1} < ki < k V A: > 0. 


( 21 ) 


Moreover, by the definition of ki it holds that i ^ Ak-i U • • • U Aj.. therefore we have that 

aL+ 1 ^ ^L+2 = ... = Af, V z G A. (22) 

By applying (22) to (19) and (20) (replacing A^*"''^ with A^), we rewrite the master-point-of-view 
algorithm in Algorithm 3. 

Inspired by [18], our analysis for Theorem 1 investigates how the augmented Lagrangian function, i.e., 

N N 

C,{x\ xS. A‘) = y; /,(x‘) + ft(xS) + - x‘) 

2=1 2=1 

N 

+ fElk.'=-4ll" (26) 

2=1 

evolves with the iteration number k, where x^ = [{x \)'^,..., and A^ = [(Aj^)^,..., (A^)^]^. 

The following lemma is one of the keys to prove Theorem 1. 


^Note that ki = —1 for fc = 0 and ki > —1 for fc > 0 
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Lemma 1 Suppose that Assumption 2 holds and p > L. Then, it holds that 

r r (m-k r^k \k\ 

L.p\JL ,JLq ,A ) — L,pyjL ,Xq,A ) 


< - 


^T+J^ll fc+1 fc||2 


\Xq - Xn 


iGAk 


+ 


+ 


P 2 

1+ 


k\\2 


S' 

i^-A-k 

(l-p) + L 


Xq XqII 


El 

i^Ak 


x^^^-x^W'^. 


(27) 


Proof: See Appendix A. ■ 

Equation (27) shows that Cp{x^, Xq,X^) is not necessarily decreasing due to the error terms 
A^lP and Next we hound the sizes of these two terms. 

First consider Eie^ 11 ^ f ■ Note from (24) and the optimality condition of (23) that, V i G Ak, 

0 = vn{x^+^) + Af+ p{x^+^ - 4^+') 

= V/i(4+i) + A4\ (28) 


For any i G A'f,, denote ki < k as the last iteration number for which worker i is arrived. Then, i G Aj: 
and thus V/i(a:f^+^) + Af^+i = 0. Since 4'^^ = 4'^^ = ''' = 4 = and A^^^+^ = = 

• • • = A^ = A4^, we obtain that Vfi{x^~^^) + A^^ = 0 Vi G A^. Therefore, we conclude that 

V/i(£c4^) + V i G V and V fe. (29) 

By (29) and the Lipschitz continuity of V/j (Assumption 2), we can bound 

||Afc+l _ ;^fc||2 < ||V/,(4+l) - V/i(4)f 

< l2||4+i- 4||2, ViGV. (30) 


By applying (30), we can further write (27) as 

r ^k+1 \fc+l'\ 

< £,p{x^, Xq, A^) + 



E ||™fc _ .rfci+l||2 

lia-o Xq II 

iGAk 


+ 


(A^)|| 4 «_ 4 |P 


— X 


fc ||2 


(31) 


Febmaiy 22, 2016 


DRAFT 








14 


From (31), one can observe that the error term (^-^) ll®o “ present due to the 

asynchrony of the network. The next lemma bounds this error term: 


Lemma 2 Suppose that Assumption 1 holds and assume that \Ak\ < S for all k, for some constant 
S G [lj./V]. Then, it holds that 

ll®0 - • (32) 

j=0 i^Aj j=0 


Proof: See Appendix B. 

The last lemma shows that Cp{x^,XQ,\^) 


is bounded below: 


Lemma 3 Under Assumption 2 and for p> L, it holds that 

Cp{x’^+\x’^+\\^+^) >F*> -oo. 


(33) 


Proof: See Appendix C. ■ 

Given the three lemmas above, we are ready to prove Theorem 1. 

Proof of Theorem 1: Note that any KKT point ({a3*}^i, ® 0 ! problem (4) satisfies the 

following conditions 


V/i«) + A* = 0, ViGV, 

(34a) 

^0 - Eti A* = 0, 

(34b) 


(34c) 


where Sq G dh{x'^) denotes a subgradient of h at x^ and dh{xQ) is the subdifferential of h at Xq. Since 
(34) also implies 

N 

^VMx*)+s*o = 0, (35) 

i=l 

where x* = Xq = ■ ■ ■ = x'^, x* is also a stationary point of the original problem (1). 
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To prove the desired result, we take a telescoping sum of (31), which yields 


L,p\JL , Xg ,A ) — L^p\Jb , JLq, y\ ) 


< 


1 + 


I] I] 11*0 - *0 

j—0 


Jf+l ||2 


^ ^ j =0 isAj 


— xiw'^ 


27 + N p 


El 

3=0 


Xg Xgll . 


By substituting (32) in Lemma 2 into (36), we obtain 

'27 + Np- S'(l + p^){t - 1 )"" 


E 

3=0 


|™i+i _ ||2 

1-^0 *oll 


k N 


^^ (1-p)-(Z, + L^) _^)EE 


2 

„0 „0 xOn 


i*r^ -*if 


< £p(x", A") - £p(x'=+\ 4 +S A'^+i) 

= (£p(ai0, xg, AO) - F*) - (£,(x"+i, A"+1) - F*) 

< Cp{x^, Xg, A^) — F* < 00, 


(36) 


(37) 


where the second inequality is obtained by applying Lemma 3, and the last strict inequality is due to 
Assumption 2 where the optimal value F* is assumed to be lower bounded. 

Then, (16) and (17) imply that the left hand side (LHS) of (37) is positive and increasing with k. Since 
the RHS of (37) is finite, we must have, as k ^ 00 , 


4+^ -4^0, -x^ ^0, V z G V. (38) 

Given (30), (38) infers 

Af+i - Af ^0, Vz G V. (39) 

We use (38) and (39) to show that every limit point of {{x^}^i,Xq, is a KKT point of 

problem (4). Firstly, by applying (39) to (24) and by (38), one obtains x^'^^ — x^'^^ —)• 0 Vz G Ak- For 
z G A%, note that z G Aj,_ (see the definition of ki above (29)) and thus, by (24), 


Xf+^ = 


X^' + p{x 


ki +1 

i 


— X 


(L),+l 

0 


), 


Febmaiy 22, 2016 


DRAFT 







16 


where denotes the last iteration number before iteration ki for which worker i is arrived. Moreover, 


since = ■ ■ ■ = x^ = x^~^^ Vi G A^, and by (24), (38) and (39), we have Vi G A^, 


|™^ + l ^ 

|J.Q II — ||J,g J. 


ki+l\ 


,fc+l 


^0 


^0 


5fc. + l| 




^ 11-^0 

0 . 


P 


(40) 


So we conclude 


a;^+^ - x\-^^ ^ 0 Vi G V. (41) 

Secondly, the optimality condition of (25) gives 

N N 

i=l i=l 

+ 7(®g+i - xl) = 0, (42) 

for some G dh{xQ'^^). By applying (41) and (38) to (42), we obtain that 

N 

So+^ - ^ Af+^ 0. (43) 

i=l 

Equations (29), (41) and (43) imply that iEq, {A^}^^) asymptotically satisfy the KKT condi¬ 

tions in (34). 

Lastly, let us show that ({®f }^i, li^i) bounded and has limit points. Since dom(/i) is 

compact and x^ G dom(/i), x^ is a bounded sequence and thus has limit points. From (41), x^, i G V, 
are bounded and have limit points. Moreover, by (29), AJ^, i G V, are bounded and have limit points as 
well. In summary, }^i, li^i) converges to the set of KKT points of problem (4) . ■ 

Proof of Corollary 1: The proof exactly follows that of Theorem 1. The only difference is that the 
coefficient of the term ~ |P in (27) reduces from to ; see the 

footnote in Appendix A. ■ 


IV. Comparison with an Alternative Scheme 

In Algorithm 2, the workers compute {xi,Xi), i £ V, and the master is in charge of computing 
xq. While such distributed implementation is intuitive and natural, one may wonder whether there exist 
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Other valid implementations, and if so, how they compare with Algorithm 2. To shed some light on this 
question, we consider in this section an alternative scheme in Algorithm 4. 

Algorithm 4 differs from Algorithm 2 in that the master handles not only the update of xq hut also that 
of {Ai}jgv; so the workers only updates {xi}. In essence, in a synchronous network. Algorithm 2 and 
Algorithm 4 are equivalent up to a change of update order^ and have the same convergence conditions. 
However, intriguingly, in an asynchronous network, the two algorithms may require distinct convergence 
conditions and hehave very differently in practice. To analyze the convergence of Algorithm 4, we make 
the following assumption. 

Assumption 3 Each function fi is strongly convex with modulus cr^ > 0 and the function h is convex. 

Under the strong convexity assumption, we are able to show the following convergence result for 
Algorithm 4. 


Theorem 2 Suppose that Assumption 1 and Assumption 3 hold true. Moreover, let y = 0 and 

^2 


0 < p < 


a 


(48) 


(5r — 3) max{2r, 3(t — 1)} ’ 
and define x^ = \ = 0,1,...,A^, where {{x^}^i,Xq) are generated by (44) and (45). 

Then, it holds that 


r N 


M^i)AH^o) ~A'* + 


i=l 


N 




< 


i=l 

i2 + 6x)C 

k 


(49) 


for all k, where C < oo is a finite constant and 6\ = max{||A*||,..., ||A^||}, in which {A*} denote the 
optimal dual variables of (4). 


The proof is presented in Appendix D. Theorem 2 somehow implies that Algorithm 4 may require 
stronger convergence conditions than Algorithm 2 in the asynchronous network, as /j’s are assumed to 
he strongly convex. Besides, different from Theorem 1 where p is advised to he large for Algorithm 2, 
Theorem 2 indicates that p needs to he small for Algorithm 4. Since p is the step size of the dual gradient 
ascent in (46), (48) implies that the master should move Aj’s slowly when r is large. Such insight is 
reminiscent of the recent convergence results for multi-hlock ADMM in [33]. 

Interestingly and surprisingly, our numerical results to he presented shortly suggest that the strongly 
convex /j’s and a small p are necessary for the convergence of Algorithm 4. 


^Algorithm 2 under the synchronous protocol is the same as Algorithm 1 with the order of (6) and (7) interchanged. 
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V. Simulation Results 

The main purpose of this section is to examine the convergence behavior of the AD-ADMM with 
respect to the master’s iteration number k. So, the simulation results to be presented are obtained by 
implementing Algorithm 3 on a desktop computer. First, we present the simulation results of the AD- 
ADMM for solving the non-convex sparse PCA problem. Second, we consider the LASSO problem and 
compare Algorithm 4 with Algorithm 2. 


A. Example 1: Sparse PCA 


Theorem 1 has shown that the AD-ADMM can converge for non-convex problems. To verify this point, 

let us consider the following sparse PCA problem [ 8 ] 

N 

min — w'^ bJ B^w+6 \\w\\i, (50) 

f=l 

where Bj S Vj = 1,..., A^, and 0 > 0 is a regularization parameter. The sparse PCA problem 

above is not a convex problem. We display in Figure 3 the convergence performance of the AD-ADMM 
for solving (50). In the simulations, each matrix Bj € R"^ is a 1000 x 500 sparse random matrix with 
approximately 5000 non-zero entries; 9 is set to 0.1 and N = 32. The penalty parameter p is set to 
p = /3 maxj=i_,,,^ 7 v Amax(-BjSj) and 7 = 0. To simulate an asynchronous scenario, at each iteration, 
half of the workers are assumed to have a probability 0.1 to be arrived independently, and half of the 
workers are assumed to have a probability 0.8 to be “arrived” independently. At each iteration, the master 
proceeds to update the variables as long as there is at least one arrived worker, i.e., A = 1. The accuracy 
is defined as 


accuracy = 


Cpix\xlX>^)-F\ 

F 


(51) 


where F denotes the optimal objective value for the synchronous case (r = 1) which is obtained by 
running the distributed ADMM (with /3 = 3) for 10000 iterations (it is found in the experiments that the 
AD-ADMM converges to the same KKT point for different values of r). One can observe from Figure 3 
that the AD-ADMM (with /3 = 3) indeed converges properly even though (50) is a non-convex problem. 

Interestingly, we note that for the example considered here, the AD-ADMM with 7 = 0 works well for 
different values of r, even though Theorem 1 suggests that 7 should be a larger value in the worst-case. 
However, we do observe from Figure 3 that if one sets /3 = 1.5 (i.e., a smaller value of p), then the 
AD-ADMM diverges even in the synchronous case (r = 1). This implies that the claim of a large enough 
p is necessary for the non-convex sparse PCA problem. 
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Fig. 3: Convergence curves of the AD-ADMM (Algorithm 2) for solving the sparse PC A problem (50); 
N = 32, 9 = 0.1, p = / 3 maxj=i^,„_Ar Amax(-BjSj) and 7 = 0. 


B. Example 2: LASSO 


In this example, we compare the convergence performance of Algorithm 4 with Algorithm 2. We 
consider the following LASSO problem 


N 

min IIAim — bjlP + 6*||iu||i, (52) 

loGK" 

i=l 

where Aj G bi G W^, i = 1,..., N, and 6 > 0. The elements of Aj’s are randomly generated 

following the Gaussian distribution with zero mean and unit variance, i.e., ~ M{0, 1); each bi is generated 
by bi = AiW^ + Ui where G M” is an n x 1 sparse random vector with approximately 0.05n non-zero 
entries and Ui is a noise vector with entries following AA(0,0.01). A star network with 16 (N = 16) 
workers is considered. To simulate an asynchronous scenario, at each iteration, half of the workers are 
assumed to have a probability 0.1 to be arrived independently, 4 workers are assumed to have a probability 
0.3 to be arrived independently, and the remaining 4 workers are assumed to have a probability 0.8 to 
be arrived independently. 

Figure 4(a) and Figure 4(b) respectively display the convergence curves (accuracy versus iteration 
number) of Algorithm 2 and Algorithm 4 for solving (52) with = 16, m = 200, n = 100 and 9 = 0.1. 
The accuracy is defined as 


accuracy = 


\Cp{x\xlX^)-F* 

F* 


(53) 


where F* denotes the optimal objective value of problem (52). One can see from Figure 4(a) that 
Algorithm 2 (with p = 500, 7 = 0) converges well for various values of delay r. From Figure 4(b), 
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(a) Algorithm 2, n = 100 (b) Algorithm 4, n = 100 




(c) Algorithm 2, n = 1000 (d) Algorithm A, n — 1000 


Fig. 4: Convergence curves of Algorithm 2 and Algorithm 4 for solving the LASSO problem in (52) 
with = 16, m = 200 and 6 = 0.1. The parameter 7 is set to zero. 


one can observe that, under the synchronous setting (i.e., r = 1), Algorithm 4 (with p = 500) exhibits 
a similar behavior as Algorithm 2 in Figure 4(a). However, under the asynchronous setting of r = 3, 
Algorithm 4 (with p = 500) diverges as shown in Figure 4(b); Algorithm 4 can become convergent if one 
decrease p to 10. Analogously, for r = 10, one has to further reduce p to 1 in order to have Algorithm 4 
convergent. However, the convergence speed of Algorithm 4 with p = 1 is much slower when comparing 
to Algorithm 2 in Figure 4(a). 

Figure 4(c) and Figure 4(d) show the comparison results of Algorithm 2 and Algorithm 4 for solving 
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(52) with n increased to 1000. Note that, given m = 200 and n = 1000, the cost functions fi{wi) = 
\\AiWi — bi\\^ in (52) are no longer strongly convex. One can observe from Figure 4(c) that Algorithm 
2 (with p = 500, 7 = 0) still converges properly for various values of r. However, as one can see from 
Figure 4(d), Algorithm 4 always diverges for various values of p even when the delay r is as small as 
two. As a result, the strong convexity assumed in Theorem 2 may also he necessary in practice. We 
conclude from these simulation results that Algorithm 2 significantly outperforms Algorithm 4 in the 
asynchronous network, even though the two have the same convergence behaviors in the synchronous 
network. 


VI. Concluding Remarks 

In this paper, we have proposed the AD-ADMM (Algorithm 2) aiming at solving large-scale instances 
of problem (1) over a star computer network. Under the partially asynchronous model, we have shown 
(in Theorem 1) that the AD-ADMM can deterministically converge to the set of KKT points of problem 
(4), even in the absence of convexity of //s. We have also compared the AD-ADMM (Algorithm 2) with 
an alternative asynchronous implementation (Algorithm 4), and illustrated the interesting fact that a slight 
modification of the algorithm can significantly change the algorithm convergence conditions/behaviors in 
the asynchronous setting. 

From the presented simulation results, we have observed that the AD-ADMM may exhibit linear 
convergence for some structured instances of problem (1). The conditions under which linear convergence 
can be achieved are presented in the companion paper [25]. Numerical results which demonstrate the 
time efficiency of the proposed AD-ADMM on a high performance computer cluster are also presented 
in [25]. 

Appendix A 
Proof of Lemma 1 

Notice that 

,J,Q ,A j — J 

— r (ryk+l ^k +1 \fe-l-l') r f^k+l k 

— i^p{j. , Xq ,a ) J-pyd. ,Xq,a ) 

L,p[X ,Xq,A j — L.p\X ) 

+ Cpix>^+\xl A") - £p(x^ xl \^). (A.1) 
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We bound the three pairs of the differences on the right hand side (RHS) of (A.l) as follows. Firstly, 
since —Xq + f + 2 ll®o — ®olP (^5) is strongly convex with respect 

to (w.r.t.) Xq with modulus 7 + Np, by [34, Definition 2.1.2], we have 


N 




k\\2 

oil 


i=l 


i=l 


N 




k-\-l 


2=1 


N 


+ fE 


fc+l ^fc+l||2 , 7 11 fc+l 


i=l 


N 


X- — X 


N 


“T 2 ll'^O -^0 


k\\2 




,A :+1 ^fc+l\ 


2=1 


2=1 


+ 7(®o'''^ “ (®o “ + 


fc+l^ I 7 + ^P|i fc+l ^ku2 


By the optimality condition of (25) and the convexity of h, we respectively have 

N N 

4^" - E 


i=l 


2=1 


+ 7(4^'-4))%o-4^')>0, 


(A.2) 


(A.3) 

(A.4) 


By subsequently applying (A.3) and (A.4) to (A.2), we obtain 

N N 

h(x^) - (x^of E + S E11*^' - 


fc ||2 

oil 


i=l 


N 


- E^ 


2=1 


fc + 1 


2=1 


N 


+ ?E 


fc+l ^k+l\\2 I 7|i fc+1 


\x- — X 


2=1 


+ 2ll=®o -=»o 


fc ||2 


> 


7 + ^P||.„fc+l ™fc ||2 

-oil ! 


I®0 — Xr 


(A.5) 


that is. 


£p(a; 


k+l „A:+1 \A;+1', 

, a^O ) ) 


C (^k+l k \A:+1\ 

— L,p\X tXq^A. ) 


< - 


^7 + ^P || fc +1 _ ku2 

11^0 J-qII . 


(A. 6 ) 
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Secondly, it directly follows from (26) that 

r (^k+l k xfc+l'i r k \k\ 

N 

= - 4 ) 

i=l 


ieAk 


+ E(4'^‘-4f(4-*' 


- Xn 


iG^k 


ikn2 


- E 


^ i 6 A 


- X, 


ot) 


(A.7) 


iG^k 


where the second equality is due to the fact that = A^ Vi G and the last equality is obtained 
hy applying 


Af+i = Af + p(, 




(A.8) 


as shown in (24). 

Thirdly, define Ci{xi,XQ, X^) = fi{xi) + xj + ^\\xi — iColP assume that p > L. Since, 
hy [34, Lemma 1.2.2], the minimum eigenvalue of the Hessian matrix of fi{xi) is no smaller than —L, 
Ci{xi,XQ, A^) is strongly convex w.r.t. Xi and the convexity parameter is given hy p—L > 0 Therefore, 
one has 


r.(^k k \k\ \ r^k \k\ 

k^i\Xi ,Xq, A. ) ^ L.i\X^ ,Xq, A ) 

I \k I ^/'^k-\-l ^k\\T {^k 


+ (v/i(*r ) + V + P(4* - 4)V (4 - 4^') 


Also, hy the optimality condition of (23), one has, Vi G Ak, 

0 = Vfi{x\^^) + X\ + - xl^ 


= {Vf,{x'l+^) + X^ + p{xt+^-xl)) 


fc+i 




(A.9) 


(A. 10) 


(A. 11) 


®When fi is a convex function, the minimum eigenvalue of the Hessian matrix of fi{xi) is zero. So, the convexity parameter 
of £i(xi, Xq) is p instead. 
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By substituting (A. 11) into (A.9) and by (26), we have 


N 

= Y,{C,{x\+\\\ 4 ) - C,{xl A^ x^)) 


i=l 


< - 


(£.(rry, A‘, xj) - £j(rrf, A‘, x‘)) 

p-L 


E 

iGA.k 


\x^+^-xH‘^ 


+ pY^ (4'^^ - 

iG^k 

where the second equality is due to x^~^^ = Vi G A’l from (23). 
After substituting (A.6), (A.7) and (A. 12) into (A.l), we obtain 


(A. 12) 


Cp{x'^+\x^+\\^+^) - Cp{x\xlX^) 


k \k\ 


2'y + Np I 1 

0 


< - ' - n iar^ - xir + -Y. 


2 

p-L 


^ ieAk 


E 


Ip 


iG^k 

+ E(V+" - A‘)’'(x‘-+‘ - xj) 


iGi^k 


+py;(xS-+‘-4f(rrf+i-rrf). 

iGA-k 


Recall the Young’s inequality, i.e., 


aFb < —||a|P + -||b|p, 
- 2(5" " 2 " " ’ 


(A. 13) 


(A. 14) 


for any a, b and (5 > 0, and apply it to the fourth and fifth terms in the RHS of (A. 13) with (5 = 1 and 
(5 = 1/p for some e > 0, respectively. Then (27) is obtained. ■ 
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Appendix B 
Proof of Lemma 2 


It is easy to show that 


k j-1 

,ii+iii2 _ II 


EEII E( 


E E 11^0 “ ^0 

j=0 ieAj 

j=0 ieAj 

<^^(r-l) 11®^-a:^+^ll2 


Xq - 


£+ 1 m |2 


r.^+l ||2 

^0 II 


(A. 15) 


j=0ieAj l=ji+l 

i-1 

"'’-0 - -^'o 

^=J.+1 
i-1 

"Aq - X'o 

J=0 i£Aj £=j—T+1 

<S{t-1)Y Y W^o-^g 

j=0£=j-T+l 

where, in the second inequality, we have applied the fact of j — r < ji < j from (21); in the last 
inequality, we have applied the assumption of \Ak\ < S for all k. Notice that, in the summation 
Yj=o l^eZj-T+i ll®o “ ll®o “ where j = 0,.. ., fc — 1, appears no more than 

r — 1 times. Thus, one can upper hound 

Y Y Il 4 (A-16) 

j=oe=j-T+i j=o 

which, combined with (A. 15), yields (32). ■ 


Appendix C 
Proof of Lemma 3 

The proof is similar to [18, Lemma 2.3]. We present the proof here for completeness. By recalling 
equation (29) and applying it to (26), one obtains 


N 


Cp{x 

N 


k+1 „/c+l \A;+1\ _ 


I ^ \K 

? *^0 ’ ^ 


) = /i(a!‘+‘)+y^/.(a,yi) 


2=1 


N 


2=1 2=1 

As V/i is Lipschitz continuous under Assumption 2, the descent lemma [36, Proposition A.24] holds 

/.(4+‘) < /,(xf+‘) + (V/.(4+‘)f(xS+‘ - 

+ -|l|a:T' -ajT'lf v i = (A. 18) 
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By combining (A.17) and (A.18), one can lower bound as 


N 


Cp{x^+\x^^+\\^+^) > /r(a;g+i) + 


2=1 


T ^ 




2=1 


which implies (33) given p > L and under Assumption 2. 


(A. 19) 


Appendix D 
Proof of Theorem 2 

For ease of analysis, we equivalendy write Algorithm 4 as follows: For iteration k = 0,1, 

arg inin fi{xi) +a;fAf+^ + ^\\xi -x\'^^\\^, Vi e Ak 
x^ Mi e A% 

fc+l _ ^k I P 11 fc+1 






= arg min /i(a;o) - «o Ei=i \^ + f Ei=i , 

Xq 


A 


/c+l _ \A; 


= Af + - x'^+^) Mi G V. 


fc+i'^ 
0 


(A.20) 

(A.21) 

(A.22) 


Here, ki is the last iteration number for which the master node receives message from worker i G Ak 
before iteration k. For i G A^, let us denote ki (k — r < ki < k) as the last iteration number for which the 
master node receives message from worker i before iteration k, and further denote ki (h — t < ki < ki) 
as the last iteration number for which the master node receives message from worker i before iteration 
ki- Then, by (A.20), it must be 


= arg min fi{xi) + a;f Af'+^ + f 11®* - Vi G A%, 


X 


fc+l ^ 


(A.23) 

(A.24) 


where the second equation is due to = ■ ■ ■ = x\ = x^~^^ Mi G A'^. 

Let us consider the following update steps 


x^+^ = 


Xr 


Xq 


.L+l||2 
'0 II ’ 

Vi G Ak 



5 

(A.25) 

.fci+l||2 

'0 II 

ViG,A^ 


- 

®o|P, 

(A.26) 



(A.27) 


= Af + /3(®f+' - ®f+') Vi G V, 

where a,/3 > 0. One can verify that (A.25)-(A.27) are equivalent to (A.20)-(A.22) and (A.23)-(A.24) if 
one considers the change of variables A* = A* /a and p = (3 ja. 
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We first consider the optimality condition of (A.25) for i G Ak'- 

0 > oa/i(x'=+i)^(4+i - <) + (Apl + /3(x‘+l - xS.+‘))^(4+i - <) 

= - <) + (a‘+1)^(4+i - *•) 

+ (A*‘+" - Af )’'(!cf+‘ - X*) + /3(xJ+' - x*‘+‘)’'(!cf+‘ - x\), (A.28) 

where we have applied (A.27) to obtain the equality. Since, under Assumption 3, fi is strongly convex, 
one has 


Oifi{xl) > afi{x\'^^) + adfi{x^+^)'^{x* - x^+^) + 


aa 




(A.29) 


aa 


Combining (A.28) and (A.29) gives rise to 

- afi{x*) + Af - x*) + 

+ (A^+i - - <) + (Af*+i - - <) 


2 " * 


+ /3(a:o^^ - £c^^+^)^(<+^ - x*) < 0 Vi G A- 


(A.30) 


On the other hand, consider the optimality condition of (A.25) for i G A%'. 

0 > aVn{x>l+Y{x^^^ - X*) + (A^+^ + /3(*f+i - 4-+^))^(a=f+i - x*) 

= aV/,(a;f+i)^(a=f+i-A) 

+ (Af-+' + Af^+i - aJ^' - - x}^^) + /3(a;f-+' - x}+^)f {x^+^ - <) 

= aV/,(a;f+i)^(a=f+i - + (A^+^)^(a;f+i - <) 

+ (Af^+i - (A-31) 

where (A.27) with k = ki and (A.24) are used to obtain the first equality. By combining (A.29) with 
(A.31), one obtains 


(^fi{x\'^^) - afi{x*) + Af (x^^^ -x*) + 


aa 




+ (Af+i - A.)^(*^^ - A) + (A^+' - A^-)^(*^' - <) 


+ I3{x] 


ki+l 


- X 


ki+l\Tf^k+l 
0 


r {xr^ - **) < 0 Vi G A. 


(A.32) 


By summing (A.30) for all i G and (A.32) for all i G and further summing the resultant two 
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terms, we obtain that 

N N N 

« fiix*) +Y^ \J -x*) + ^'^\\x\ 


N 9 

aa 


2=1 


2=1 


2 = 1 


2 = 1 


^+1 _ 

2 


+ E(^: 


‘+‘ - - X 




) + E ‘ - ^.)^( 

i<^Ai 




(a) 


+ E (E""' - - <) + E (E""' - a*-)"’(4+" - * 


ki+l _ \k\T / k+l _ * 

2 '^2 / V*^2 *^2 

i^jA.k 2G^^ 

k-\-l ki+l\T / k-\-l * 


+ Y1 

2G^fc 


Xd' -x„ 


(x‘+>-x*)+ E/5( 

ie.4g 


A^iH-l ^ki-\-\'\^ / k-\-\ 


®o - ®0 


(®f+^ - O < 0. (A.33) 


(b) 


The term (a) in (A.33), after adding and subtracting Yhi^A'^ ~ )> can be written as 


\Tt^k+l 


N 


(o) = ^(A‘+1 - Ai)^(x‘+‘ - X*) + ^ (A*-+> - A‘+‘)’-(x‘+i - X*). 

i€Ai 


(A.34) 


2=1 


The term (b) in (A.33) can be expressed as 

(b) = 53 /J(xS+‘ - xj + x‘ - x*-+>)^(xf+> - X*) + E 


— X, 


iSAk 

N 


i&At 


J2 /3(*S+' - xS)’'(x‘+‘ - X*) + E - xJ-+‘ - x‘+‘ + xS)^(x‘+‘ - X*) 


2=1 

+ Y, I3(^« - *o‘+‘)^(x‘+‘ - X*). 

iG,A.k 

Note that, by applying (A.27) and the fact of x* = x^Mi ^ V, one can write 

N N 

Y, /3(4+i - -x:) = Y - 4^' + 4^ 


(A.35) 


— X- 


2=1 


2=1 

N 


= y;(xS+> - xJ)^(A*+‘ - A?) + JV/3(xS+> - xJ)^(xS+‘ - xj). 

(A.36) 


2=1 


So, The term (b) in (A.35) is given by 


N 


(b) = y3(x‘+‘ - x‘)^(A‘+i - A?) + JV^(xS+i - xJ)’-(xS+i - xS) 


i=l 


+ f((xS-+‘ - xS-+‘ - xS+> + x*)^(x‘+> - x*) + ^(x* - xJ-+‘)’-(x‘+‘ - X*). 

2G.A^ 2G^fc 

(A.37) 
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It can be shown that 


N 


- ^?) > 0- 


(A.38) 


2=1 


To see this, consider the optimality condition of (A.26): Vxq S 


N 


0 > ah(x^+^) - ah(xo) - + ^(x^+^ - x^+^)f(x^+^ - xq) 


2 = 1 
N 


ah(x^+^) - ah(xo) - - xq), 


(A.39) 


2 = 1 


where the equality is due to (A.27). By letting xq = Xq in (A.39) and also considering (A.39) for iteration 


k and xq = Xq~^^, we have 


N 


0 > ah(x^+^} - ah(x^) - - 4), 


2 = 1 
N 


0 > a/i(4) - ah(x^+^) - ^(Af)^(4 - x; 


k+i] 

0 J’ 


(A.40) 


2 = 1 


respectively. By summing the above two equations, we obtain (A.38). Moreover, by letting xq = x* = Xq 


N 


N 


in (A.39), we have 

a/i(4^^) - a/i(4) - 4) - - Ai)^(4’^^ - 4) < 0- (A.41) 

2=1 2=1 

By summing (A.41) and (A.33) followed by applying (A.34), (A.37) and (A.38), one obtains 


N 


N 


N 


a 


+ ah{x 


fc+i 'i 


N o 
acj 


— a 


^ /i(4) - a^(4) + XI ~ + X 


*I|2 


2 = 1 


2=1 


2=1 


2 = 1 


N 


+ 5 - A*) + iV/)(xJ+‘ - x*)^(4+‘ - xj) 


i=l 


+ y; (Af'+‘ - A‘+>+A*-+‘ - aS.)’'(x‘+' - + E 




2G^fc 


+ y; /3(xS'+‘ - x5-+‘ - xj+>+xS)^(xy - x*) + /3(xS - xS-+‘)’’(xyi - x*) < o, 

(A.42) 


where the seventh term in the LHS is obtained by applying (A.27). 
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We sum (A.42) for A: = 0,..., if — 1 and take the average, which yields 

K-l r N r N 1 K-l N ^ 


k=0 i=l 


i=l 


— X 


fc+l\ 


fc=0 i=l 



K-l N 2 

- X 2 " * *" 

k=0 i=l 


1 


K-l 


+ k1^[-i1 - E - y 

k=o ^ ieAt ieAk 


(c) 


K-l 


+ :^ E ( - E - y+4f(y - o - E 


k ^ki+l\T ( k+1 * 


Xq - Xq 


k=0 ^ i&At 


iGAk 


(d) 


It is easy to see that term (a) 


(^) = ^ E (- ^*11' -11^' - ^*11' + 


K-l 


k=0 


= 2||Af-AdP- 


k\\2 
i II 


K-l 


2" * 

and similarly, term (h) 

(b) = dE 


-||A°-Aif+ - E 11^: 


k+1 \fc||2 


2 -* 
k=0 


''ill 5 


K-l 


k=0 


\^k-\-l ^★||2 11^^ ^*11^ -1- ll^^d”^ ^^11^ 

1^-0 J^oll ll-^o -^oll d" ll-^o '^oll 


— ^ ll^-f^ _ ^*||2 _ ^ 11^0 _ ^*||2 

— 22 '2 


K-l 


1 

+ 9 E 


„k+l h\\2 


k=0 


X. - X, 


(A.43) 


(A.44) 


(A.45) 
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Notice that one can bound the term ~ "^t ~ **) ttt (c) as follows 

E E - “=n = E E E 


k/ci + l 


fc—0 2€v4.fc 

K-1 k-1 

SEE E 

k=0 iGAk £=k—T+l 

N K-1 k-1 

< 


k=0 iGAk e.=ki+l 


\x^^^-x1\ 


EE E (4l|X'-A«P + ^||atr-tn*|p) 

i=l k=0 £=k-r+l ^ ^ ^ 


< 


N K-1 

EE 


T — 1 , 


I \ k-\-l \ /e 11 ^ I 

2^2 --^*11 + 


2 , (r-l)^.,k+i 


x---x*f 


i=l k=0 

where the second inequality is obtained hy applying the Young’s inequality: 

a^b< ^||af+ ^||6||2 


(A.46) 

(A.47) 

(A.48) 


for any a, b and 5 > 0; the last inequality is caused hy the fact that the term — A^|p for each 

k does not appear more than r — 1 times in the RHS of (A.46). By applying a similar idea to the first 
term of (c) and hy (A.47), one eventually can hound (c) as follows 


O / .. N N K— 1 , 1 \ o2 ^ K— 1 

(■=) < E E iia 7‘ - E E114+' - <i7 


2/32 


Similarly, the term Y.k=o I5 {xq — Xq'^^Y ~ ^i) iri (d) can he upper hounded as follows 

E E -*.*)< E E E ' 

k=0 iGAk k=0 iGAk £=k—T+l 


(A.49) 


i=l k=0 


i=l k=0 


„ki+l\T f^k+l 


„ki+l I 


- xi\ 


< 


< 


N K-1 k-1 , 

EE E il 

i=l k=0 £=k—T+l 

N K-1 


^k-\-l _ ^k[[2 _i_ ii^/c+1 _ „*||2 

211 U Ull ' 2 ^ ^ 


EE 


T — 1 , 


^^+1 _ 11^ -U 

*^0 *^nll ' 


2 , ll^fc+1 ^*l|2 


X, — X, 


i=l k=0 

By applying a similar idea to the first term of (d) and hy (A.51), one can hound (d) as follows 

N K-1 

o‘2\\^k-\-\ ^★||2 


2=1 fc=0 


(■J) s E E - *SiP+ t/3"ii4+‘ - 41 


(A.50) 

(A.51) 

(A.52) 
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After substituting (A.44), (A.45), (A.49) and (A.52) into (A.43), we obtain that 

N r N 1 N 


a 




0 J 


^ i=l 


— a 




i=l 


2=1 


s|E E«4 +‘)+m^, 


K-l r N 


fc + lN 


< 


1 


2/3 a: 


fc=0 i=l 
N 

EII^S 


— a 


r N 1 K-l N ^ 

M^i) + h{xQ) + X X 

- i=l fc=0 i=l 


r(x‘+‘-xj+') 


Af 


- A,||^- 


X 11'^^ “ + ^ll®n - ®nlP - 


iV/3, 


+ 


2=1 

3(r - 1) 

2KP^ 2pK 


2/3Ar^"'’* ' 2K" ° 2K 

i=l 


N K-l 


i=l k=0 

K-l N 


K-l 


fc+1 ^k\\2 


Xq' -Xn 


yEEi«"‘-s?ifH^-i)E 

3(r-l)/3^ + 2r/3^-a^^Yi^fc+i ^*||2 


4ee 


K 

k=0 i=l 

where the first inequality is by the convexity of /j’s and h. 
According to (A.53), by choosing 

/3 > max{2r, 3(r — 1)}, a > 


x; ' — X, 


(5r — 3)/3^ 


cr^ 


and recalling that Xi = \i/a and p = P/a, one can obtain 

■ N I r ^ 1 ^ 

'^Mxf) + h{x^) - X/*(®?)+ ^(®o) -x^ 


0 J 


i=l 


i=l 


i=l 


N 


< V IIAO - A-I|2 + ^\\t° - T 


i=l 


*||2 

oil ■ 


Note that (A.54) is equivalent to 

p = P/a < 


a 


< 


a 


(5r — 3)/3 (5r — 3) max{2r, 3(r — 1)} 

Vi G V in (A.57), and note that, by the duality theory [37], 
1 r ^ 1 Af 

+ h{x/,) +J2iKfixf-x^)>0. 


Now, let Aj — A7 + II-)(■ _II 


N 


2=1 

Thus, we obtain that 

N 


Eii*f-*fiis^ 


2=1 


— max 
2p ||a||<l 


2=1 


N 


2=1 


2=1 


X \ + ^ll®o - ^ 


r.*||2 

-oil 


A (A^57) 


(A.53) 


(A.54) 


(A.55) 


(A.56) 
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On the other hand, let A* = A* in (A.57), and note that, 

N -I r N 1 ^ 

+ h{x^) +'^{X*f{xf-x^) 




*- 2=1 


> 


'^fi{xf) + h{x, 

N 

^ fi{xf) + h{x, 


*- 2=1 


2=1 


2=1 


N 


2 = 1 


^ fi{x*) + h{x^) -6x^\\xf -x^\ 


N 


i=l 


where Ja = max{||A^||,..., ||A^||}. Thus, we obtain that 


N 


2=1 


^fiixf) + h{x^) - ^ fi{x*) + h{x^) 


N 


i=l 


< 


^xCl , ^ Y ^||\0 \*||2 , -^^ 11^0 ™*||2 _ '^ aC'i + (72 


+ 


K 2pK ^ 

2 = 1 

Finally, comhining (A.57) and (A.59) gives rise to (52). 


■I ~ “T 2_^ 11*0 ~ ~ 


K 


(A.58) 


(A.59) 
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Algorithm 2 Asynchronous Distributed ADMM for (4). 

1: Algorithm of the Master: 

2: Given initial variable and broadcast it to the workers. Set k = 0 and di = ■ ■ ■ = di\f = 0] 

3: repeat 

4: wait until receiving {xi, Xi}i^_Ak from workers i G Ak such that > A and di < t — 1 

Vi G Al- 

5: update 


X 


,fc+l _ I 

Vi G Ak 

" 1 

ViG^^ 

, fc+1 ^ f 

Vi G Ak 

1 A,^ 

yi€Al 


Vi G Ak 

\ di + 1 

yieAl 

z^'^^ = aig min < h{xQ) 

XqGm 


+ fEii ll4+'-=»oll' 


xfc+1 


r . fc ||2 


6: broadcast x^'^^ to the workers in Ak- 

7: set /c •(— /c + 1. 

8: until a predefined stopping criterion is satisfied. 


(9) 

( 10 ) 

( 11 ) 


( 12 ) 


1: Algorithm of the ith Worker: 

2: Given initial and set ki = 0. 

3: repeat 

4: wait until receiving xq from the master node. 

5: update 


= arg min fi{xi) + xjxf + ^\\xi - xq\\^, (13) 

yk^ + l ^ yk, ^ 

6: send to the master node. 

7: set ki ^ ki + 1. 

8: until a predefined stopping criterion is satisfied. 
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Algorithm 3 Asynchronous distributed ADMM from the master’s point of view. 
1: Given initial variables and A°; set Xq = x^ and /c = 0. 

2 : repeat 
3: update 


arg min <j fi{xi) + xj\^ 




= 


' 2 11*^^ *^0 II 


X^ 


, Vi G A-k ’ 
ViG^? 


Af+ VzgA 




X 


k+l . 


aJoelb 




= arg min I h{xo) - x^ A*^^^ 


+fEr=iii*r'-*of+iiia.o-*gf 


4: set /c ^ A: + 1. 

5: until a predefined stopping criterion is satisfied. 


(23) 


(24) 


(25) 
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Algorithm 4 An Alternative Implementation of Asynchronous Distributed ADMM. 

1 : Algorithm of the Master: 

2: Given initial variable and broadcast it to the workers. Set k = Q and di = ■ ■ ■ = 

3: repeat 

4: wait until receiving {sj, Aj}jg_4^ from workers i S Ak such that |.Afc| > A and di < t — 1 

Vi G Al- 

5: update 


™fc+i _ / Vi G .Afc 
^ x\ Vi G A% 

r 0 Vi G .Afc 
di = \ 

^ dj + 1 Vi G A'^ 

4+^ = arg min | h{xQ) - xl Y.i=i 

ajoSK" 

+ f E*=l + ill®0 - |> 

X^+^ = \\ + p{xf^^-x^+^) ViGV. 


(44) 


(45) 

(46) 


6: broadcast x^'^^ and to the workers in Ak- 

l\ set k ^— /c ~t“ 1. 

8: until a predefined stopping criterion is satisfied. 


1: Algorithm of the ith Worker: 

2: Given initial and set ki = 0. 

3: repeat 

4: wait until receiving {xq, Aj) from the master node. 

5: update 


= arg min fi{xi) + xfXi + ^\\xi - xq^, (47) 

cCiGR" ^ 

6: send to the master node. 

7: set k-i i — k-i 

8: until a predefined stopping criterion is satisfied. 
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